Predicting AirBNB Costs in New York City (Statistical Modelling)
Name(s) & ID(s) of Group Members:
Table of Contents - Phase 2
Phase 1 Summary
Report Overview
Overview of Methodology
Model Overview
One-Hot-Encoding
Model Fitting with OLS
Diagnostic Checks for Full Model
Backward Feature Selection (Reduced Model)
Diagnostic Checks for Reduced Model
Project Summary
Summary of Findings
Conclusions
Phase 1 Summary
In part of this two-part or phase project, we recently completed the first part of Phase 1. The goal of the investigation in Phase 1, was to perform data-preprocessing and data exploration on the dataset of New York Airbnb vacation rental prices. We took price as out compared to other key factors to help tourists and property owners to see how prices range across their area and the whole of New York City.
In Phase 1 the overall dataset was cleaned by removing outliers, dropping unnecessary and inappropriate columns to machine learning. The target feature was assigned with the 'price_per_night'. Furthermore, different levels of visualisations were plotted to help identify trends and correlations between the target feature and other descriptive features across the dataset.
The findings revealed that there was a significant impact on the price based on the number of host listings, room types and the boroughs (Neighbourhood Groups) of listings. Furthermore, we were also able to identify that many of our other variables such as availability, total reviews and reviews per month had no effect on the Price (Our Target Variable). Overall, we found that the borough (Neighbourhood Groups) tends to be the biggest indicator of the price of an Airbnb rental.
Report Overview
This report aims to use price_per_night from the Airbnb dataset as the target feature and predict its value in relation to other suitable features, both numerical and categorical. This report will achieve the fitting of the cleaned and preprocessed data to a statistical model through the use of a python module sklearn, and thus create visualisations of them. Moreover, factors and features of the model will be analysed such as the plots of residuals in relation to the suitability of certain statistical models which will be utilised to describe the data of the chosen data set.
Overview of Methodology
The dataset that was investigated was based out of New York City, regarding the city's AirBNB usage measured in various metrics. The aim of the statistical modelling was to explore a topic that we desired and the requirement was that the dataset must be supportive of multi-linear regression analysis.
Multi-linear regression is when we use multipule explanatory or independent variable against the single response variable. This is useful as it allows us to see the overall strength of the response variable compared to many independent variables. Furthemore, we can also see how each independent variable can affect the strength on a case-by-case basis that is seeing how each independent variable affects the dependent variable. Therefore, multi-linear regression is a representation of the overall strength between the response variable and multiple independent variables.
Our plan was to include all the features of the dataset that significantly aid in predictive modelling. By one-hot-encoding our dataset, including the neighbourhood feature, we will end up with hundreds of features; we expect to use neighbourhoods with the lowest p values to increase the accuracy of our statistical modelling solution.
We then fit the data to an OLS regression model. We use a summary of the model to check for the association between our independent variables. We then check for the 4 assumptions of a multiple linear regression model: Residual normality, constant residual variance, residual independence and a significant linear correlation.
To reduce our model, we sort our regression model by the features with the highest p values, remove them from the model and recalculate the regression results until we are left with features with a p-value lower than 0.05.
The categorical features of the preprocessed dataset will be converted to binary, to comply with the requirements of the stats model function.
We will import all necesssary libarires required for the stastical modelling and check the shape of the dataframe to ensure we are within the 5000 row limit and have the required columns.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
import patsy
import scipy.stats as st
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
plt.style.use("ggplot")
sns.set(rc = {'figure.figsize':(10,10)})
df_airbnb = pd.read_csv("Phase2_Group61.csv")
df_airbnb.shape
(5000, 9)
df_airbnb.head(10)
| neighbourhood_group | neighbourhood | room_type | price_per_night | minimum_nights_to_book | number_of_reviews | reviews_per_month | calculated_host_listings_count | availability_365 | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Manhattan | East Village | Entire home/apt | 151 | 2 | 214 | 2.53 | 3 | 256 |
| 1 | Manhattan | Lower East Side | Entire home/apt | 220 | 2 | 146 | 3.45 | 1 | 112 |
| 2 | Queens | Elmhurst | Private room | 41 | 2 | 44 | 5.24 | 1 | 123 |
| 3 | Brooklyn | Bedford-Stuyvesant | Private room | 59 | 4 | 1 | 1.00 | 2 | 15 |
| 4 | Brooklyn | Park Slope | Entire home/apt | 200 | 2 | 2 | 0.05 | 1 | 0 |
| 5 | Manhattan | West Village | Private room | 100 | 4 | 1 | 0.04 | 1 | 0 |
| 6 | Brooklyn | Bay Ridge | Entire home/apt | 200 | 2 | 5 | 2.42 | 2 | 161 |
| 7 | Brooklyn | Williamsburg | Private room | 75 | 2 | 1 | 0.02 | 1 | 0 |
| 8 | Brooklyn | Flatlands | Entire home/apt | 85 | 1 | 21 | 2.26 | 1 | 82 |
| 9 | Queens | Forest Hills | Entire home/apt | 100 | 1 | 1 | 0.02 | 1 | 0 |
As we can see we are going to be using the above DataFrame for the Stastical Modelling within the Investigation. Now let's check if the datatypes of the variables are as expected. All variables data-types are as expected
df_airbnb.dtypes
neighbourhood_group object neighbourhood object room_type object price_per_night int64 minimum_nights_to_book int64 number_of_reviews int64 reviews_per_month float64 calculated_host_listings_count int64 availability_365 int64 dtype: object
We will now display the variables we are going to be using in the Regression Model.
For Independent Variables:
columns = df_airbnb.loc[:, df_airbnb.columns != 'price_per_night'] # Storing all Independent Variables
columns.columns
Index(['neighbourhood_group', 'neighbourhood', 'room_type',
'minimum_nights_to_book', 'number_of_reviews', 'reviews_per_month',
'calculated_host_listings_count', 'availability_365'],
dtype='object')
For Dependent Variables:
columns = df_airbnb.loc[:, df_airbnb.columns == 'price_per_night'] # Storing all Dependent Variables
columns.columns
Index(['price_per_night'], dtype='object')
One-Hot-Encoding
Lets make a new copy of the Original DataFrame. We will use this new copy for one-hot-encoding the categorical columns in the Dataset
dfnew = df_airbnb
dfnew.head()
| neighbourhood_group | neighbourhood | room_type | price_per_night | minimum_nights_to_book | number_of_reviews | reviews_per_month | calculated_host_listings_count | availability_365 | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Manhattan | East Village | Entire home/apt | 151 | 2 | 214 | 2.53 | 3 | 256 |
| 1 | Manhattan | Lower East Side | Entire home/apt | 220 | 2 | 146 | 3.45 | 1 | 112 |
| 2 | Queens | Elmhurst | Private room | 41 | 2 | 44 | 5.24 | 1 | 123 |
| 3 | Brooklyn | Bedford-Stuyvesant | Private room | 59 | 4 | 1 | 1.00 | 2 | 15 |
| 4 | Brooklyn | Park Slope | Entire home/apt | 200 | 2 | 2 | 0.05 | 1 | 0 |
We will first re-check that all the datatypes for the columns are corrected and as expected. All numerical variables must be int or float and all categorical variables must be object format.
This looks correct and data-types of columns are as expected.
dfnew.dtypes
neighbourhood_group object neighbourhood object room_type object price_per_night int64 minimum_nights_to_book int64 number_of_reviews int64 reviews_per_month float64 calculated_host_listings_count int64 availability_365 int64 dtype: object
We will now use Python's join function which allows us to concenate strings. We will extract the column names and conceneate them together, with the "+" symbol seperating them.
form_deps = ' + '.join(dfnew.columns)
form_str = 'price_per_night ~ ' + form_deps
print(form_str)
price_per_night ~ neighbourhood_group + neighbourhood + room_type + price_per_night + minimum_nights_to_book + number_of_reviews + reviews_per_month + calculated_host_listings_count + availability_365
We now have the data we want to one-hot-encode. We will one-hot encode the data using the get_dummies() function which will automatically one-hot encode multipule categorical variables.
We also had to make some modifications to the names of the columns to be able to be fitted within our OLS Regression Model. We did this by removing spaces, astrophes, dahses and dots and replacing them with with blank values. This ensured that no errors will occur in the OLS Regression Model Fit.
df_airbnb_encoded = pd.get_dummies(dfnew, drop_first = True)
df_airbnb_encoded.columns = df_airbnb_encoded.columns.str.replace(' ', '') # Remove Spaces
df_airbnb_encoded.columns = df_airbnb_encoded.columns.str.replace("'", "") # Remove Astrophes
df_airbnb_encoded.columns = df_airbnb_encoded.columns.str.replace("-", "") # Remove Dashes
df_airbnb_encoded.columns = df_airbnb_encoded.columns.str.replace(".", "") # Removes Dots
df_airbnb_encoded = df_airbnb_encoded.rename(columns = {"neighbourhood_Prince'sBay":"neighbourhood_PrincesBay"}) # Renamed one of the Columns Specifically
df_airbnb_encoded.head()
| price_per_night | minimum_nights_to_book | number_of_reviews | reviews_per_month | calculated_host_listings_count | availability_365 | neighbourhood_group_Brooklyn | neighbourhood_group_Manhattan | neighbourhood_group_Queens | neighbourhood_group_StatenIsland | neighbourhood_Arrochar | neighbourhood_Arverne | neighbourhood_Astoria | neighbourhood_BathBeach | neighbourhood_BatteryParkCity | neighbourhood_BayRidge | neighbourhood_BayTerrace | neighbourhood_Baychester | neighbourhood_Bayside | neighbourhood_Bayswater | neighbourhood_BedfordStuyvesant | neighbourhood_BelleHarbor | neighbourhood_Belmont | neighbourhood_Bensonhurst | neighbourhood_BergenBeach | neighbourhood_BoerumHill | neighbourhood_BoroughPark | neighbourhood_Briarwood | neighbourhood_BrightonBeach | neighbourhood_Bronxdale | neighbourhood_BrooklynHeights | neighbourhood_Brownsville | neighbourhood_Bushwick | neighbourhood_CambriaHeights | neighbourhood_Canarsie | neighbourhood_CarrollGardens | neighbourhood_Chelsea | neighbourhood_Chinatown | neighbourhood_CityIsland | neighbourhood_CivicCenter | neighbourhood_ClasonPoint | neighbourhood_Clifton | neighbourhood_ClintonHill | neighbourhood_CobbleHill | neighbourhood_CollegePoint | neighbourhood_ColumbiaSt | neighbourhood_Concord | neighbourhood_Concourse | neighbourhood_ConcourseVillage | neighbourhood_ConeyIsland | neighbourhood_Corona | neighbourhood_CrownHeights | neighbourhood_CypressHills | neighbourhood_DUMBO | neighbourhood_DitmarsSteinway | neighbourhood_DonganHills | neighbourhood_DowntownBrooklyn | neighbourhood_DykerHeights | neighbourhood_EastElmhurst | neighbourhood_EastFlatbush | neighbourhood_EastHarlem | neighbourhood_EastMorrisania | neighbourhood_EastNewYork | neighbourhood_EastVillage | neighbourhood_Elmhurst | neighbourhood_EmersonHill | neighbourhood_FarRockaway | neighbourhood_Fieldston | neighbourhood_FinancialDistrict | neighbourhood_Flatbush | neighbourhood_FlatironDistrict | neighbourhood_Flatlands | neighbourhood_Flushing | neighbourhood_Fordham | neighbourhood_ForestHills | neighbourhood_FortGreene | neighbourhood_FortHamilton | neighbourhood_FreshMeadows | neighbourhood_Glendale | neighbourhood_Gowanus | neighbourhood_Gramercy | neighbourhood_Gravesend | neighbourhood_GreatKills | neighbourhood_Greenpoint | neighbourhood_GreenwichVillage | neighbourhood_GrymesHill | neighbourhood_Harlem | neighbourhood_HellsKitchen | neighbourhood_Highbridge | neighbourhood_Hollis | neighbourhood_HowardBeach | neighbourhood_HuntsPoint | neighbourhood_Inwood | neighbourhood_JacksonHeights | neighbourhood_Jamaica | neighbourhood_JamaicaEstates | neighbourhood_JamaicaHills | neighbourhood_Kensington | neighbourhood_KewGardens | neighbourhood_KewGardensHills | neighbourhood_Kingsbridge | neighbourhood_KipsBay | neighbourhood_Laurelton | neighbourhood_LittleItaly | neighbourhood_LongIslandCity | neighbourhood_Longwood | neighbourhood_LowerEastSide | neighbourhood_MarbleHill | neighbourhood_MarinersHarbor | neighbourhood_Maspeth | neighbourhood_Melrose | neighbourhood_MiddleVillage | neighbourhood_MidlandBeach | neighbourhood_Midtown | neighbourhood_Midwood | neighbourhood_MorningsideHeights | neighbourhood_MorrisHeights | neighbourhood_MorrisPark | neighbourhood_Morrisania | neighbourhood_MottHaven | neighbourhood_MountEden | neighbourhood_MountHope | neighbourhood_MurrayHill | neighbourhood_NavyYard | neighbourhood_NewBrighton | neighbourhood_NewSpringville | neighbourhood_NoHo | neighbourhood_Nolita | neighbourhood_NorthRiverdale | neighbourhood_Norwood | neighbourhood_Oakwood | neighbourhood_OzonePark | neighbourhood_ParkSlope | neighbourhood_Parkchester | neighbourhood_PelhamGardens | neighbourhood_PortMorris | neighbourhood_PortRichmond | neighbourhood_PrincesBay | neighbourhood_ProspectHeights | neighbourhood_ProspectLeffertsGardens | neighbourhood_QueensVillage | neighbourhood_RandallManor | neighbourhood_RedHook | neighbourhood_RegoPark | neighbourhood_RichmondHill | neighbourhood_Ridgewood | neighbourhood_RockawayBeach | neighbourhood_RooseveltIsland | neighbourhood_Rosebank | neighbourhood_Rosedale | neighbourhood_SeaGate | neighbourhood_SheepsheadBay | neighbourhood_SoHo | neighbourhood_SouthBeach | neighbourhood_SouthOzonePark | neighbourhood_SouthSlope | neighbourhood_SpringfieldGardens | neighbourhood_StAlbans | neighbourhood_StGeorge | neighbourhood_Stapleton | neighbourhood_StuyvesantTown | neighbourhood_Sunnyside | neighbourhood_SunsetPark | neighbourhood_TheaterDistrict | neighbourhood_ThrogsNeck | neighbourhood_TodtHill | neighbourhood_Tompkinsville | neighbourhood_Tottenville | neighbourhood_Tremont | neighbourhood_Tribeca | neighbourhood_TwoBridges | neighbourhood_UniversityHeights | neighbourhood_UpperEastSide | neighbourhood_UpperWestSide | neighbourhood_VanNest | neighbourhood_VinegarHill | neighbourhood_Wakefield | neighbourhood_WashingtonHeights | neighbourhood_WestBrighton | neighbourhood_WestFarms | neighbourhood_WestVillage | neighbourhood_WestchesterSquare | neighbourhood_Whitestone | neighbourhood_Williamsbridge | neighbourhood_Williamsburg | neighbourhood_WindsorTerrace | neighbourhood_Woodhaven | neighbourhood_Woodlawn | neighbourhood_Woodside | room_type_Privateroom | room_type_Sharedroom | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 151 | 2 | 214 | 2.53 | 3 | 256 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 220 | 2 | 146 | 3.45 | 1 | 112 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 41 | 2 | 44 | 5.24 | 1 | 123 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 3 | 59 | 4 | 1 | 1.00 | 2 | 15 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4 | 200 | 2 | 2 | 0.05 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
We will now extract and concenate the columns of the encoded DataFrame and store in the form_deps_encoded variable. We will then concenate the form_deps_encoded variable with the name of our target variable (price_per_night)
form_deps_encoded = ' + '.join(df_airbnb_encoded.columns)
form_deps_encoded = form_deps_encoded.replace('price_per_night + ', '')
form_str_encoded = 'price_per_night ~ ' + form_deps_encoded
form_str_encoded
'price_per_night ~ minimum_nights_to_book + number_of_reviews + reviews_per_month + calculated_host_listings_count + availability_365 + neighbourhood_group_Brooklyn + neighbourhood_group_Manhattan + neighbourhood_group_Queens + neighbourhood_group_StatenIsland + neighbourhood_Arrochar + neighbourhood_Arverne + neighbourhood_Astoria + neighbourhood_BathBeach + neighbourhood_BatteryParkCity + neighbourhood_BayRidge + neighbourhood_BayTerrace + neighbourhood_Baychester + neighbourhood_Bayside + neighbourhood_Bayswater + neighbourhood_BedfordStuyvesant + neighbourhood_BelleHarbor + neighbourhood_Belmont + neighbourhood_Bensonhurst + neighbourhood_BergenBeach + neighbourhood_BoerumHill + neighbourhood_BoroughPark + neighbourhood_Briarwood + neighbourhood_BrightonBeach + neighbourhood_Bronxdale + neighbourhood_BrooklynHeights + neighbourhood_Brownsville + neighbourhood_Bushwick + neighbourhood_CambriaHeights + neighbourhood_Canarsie + neighbourhood_CarrollGardens + neighbourhood_Chelsea + neighbourhood_Chinatown + neighbourhood_CityIsland + neighbourhood_CivicCenter + neighbourhood_ClasonPoint + neighbourhood_Clifton + neighbourhood_ClintonHill + neighbourhood_CobbleHill + neighbourhood_CollegePoint + neighbourhood_ColumbiaSt + neighbourhood_Concord + neighbourhood_Concourse + neighbourhood_ConcourseVillage + neighbourhood_ConeyIsland + neighbourhood_Corona + neighbourhood_CrownHeights + neighbourhood_CypressHills + neighbourhood_DUMBO + neighbourhood_DitmarsSteinway + neighbourhood_DonganHills + neighbourhood_DowntownBrooklyn + neighbourhood_DykerHeights + neighbourhood_EastElmhurst + neighbourhood_EastFlatbush + neighbourhood_EastHarlem + neighbourhood_EastMorrisania + neighbourhood_EastNewYork + neighbourhood_EastVillage + neighbourhood_Elmhurst + neighbourhood_EmersonHill + neighbourhood_FarRockaway + neighbourhood_Fieldston + neighbourhood_FinancialDistrict + neighbourhood_Flatbush + neighbourhood_FlatironDistrict + neighbourhood_Flatlands + neighbourhood_Flushing + neighbourhood_Fordham + neighbourhood_ForestHills + neighbourhood_FortGreene + neighbourhood_FortHamilton + neighbourhood_FreshMeadows + neighbourhood_Glendale + neighbourhood_Gowanus + neighbourhood_Gramercy + neighbourhood_Gravesend + neighbourhood_GreatKills + neighbourhood_Greenpoint + neighbourhood_GreenwichVillage + neighbourhood_GrymesHill + neighbourhood_Harlem + neighbourhood_HellsKitchen + neighbourhood_Highbridge + neighbourhood_Hollis + neighbourhood_HowardBeach + neighbourhood_HuntsPoint + neighbourhood_Inwood + neighbourhood_JacksonHeights + neighbourhood_Jamaica + neighbourhood_JamaicaEstates + neighbourhood_JamaicaHills + neighbourhood_Kensington + neighbourhood_KewGardens + neighbourhood_KewGardensHills + neighbourhood_Kingsbridge + neighbourhood_KipsBay + neighbourhood_Laurelton + neighbourhood_LittleItaly + neighbourhood_LongIslandCity + neighbourhood_Longwood + neighbourhood_LowerEastSide + neighbourhood_MarbleHill + neighbourhood_MarinersHarbor + neighbourhood_Maspeth + neighbourhood_Melrose + neighbourhood_MiddleVillage + neighbourhood_MidlandBeach + neighbourhood_Midtown + neighbourhood_Midwood + neighbourhood_MorningsideHeights + neighbourhood_MorrisHeights + neighbourhood_MorrisPark + neighbourhood_Morrisania + neighbourhood_MottHaven + neighbourhood_MountEden + neighbourhood_MountHope + neighbourhood_MurrayHill + neighbourhood_NavyYard + neighbourhood_NewBrighton + neighbourhood_NewSpringville + neighbourhood_NoHo + neighbourhood_Nolita + neighbourhood_NorthRiverdale + neighbourhood_Norwood + neighbourhood_Oakwood + neighbourhood_OzonePark + neighbourhood_ParkSlope + neighbourhood_Parkchester + neighbourhood_PelhamGardens + neighbourhood_PortMorris + neighbourhood_PortRichmond + neighbourhood_PrincesBay + neighbourhood_ProspectHeights + neighbourhood_ProspectLeffertsGardens + neighbourhood_QueensVillage + neighbourhood_RandallManor + neighbourhood_RedHook + neighbourhood_RegoPark + neighbourhood_RichmondHill + neighbourhood_Ridgewood + neighbourhood_RockawayBeach + neighbourhood_RooseveltIsland + neighbourhood_Rosebank + neighbourhood_Rosedale + neighbourhood_SeaGate + neighbourhood_SheepsheadBay + neighbourhood_SoHo + neighbourhood_SouthBeach + neighbourhood_SouthOzonePark + neighbourhood_SouthSlope + neighbourhood_SpringfieldGardens + neighbourhood_StAlbans + neighbourhood_StGeorge + neighbourhood_Stapleton + neighbourhood_StuyvesantTown + neighbourhood_Sunnyside + neighbourhood_SunsetPark + neighbourhood_TheaterDistrict + neighbourhood_ThrogsNeck + neighbourhood_TodtHill + neighbourhood_Tompkinsville + neighbourhood_Tottenville + neighbourhood_Tremont + neighbourhood_Tribeca + neighbourhood_TwoBridges + neighbourhood_UniversityHeights + neighbourhood_UpperEastSide + neighbourhood_UpperWestSide + neighbourhood_VanNest + neighbourhood_VinegarHill + neighbourhood_Wakefield + neighbourhood_WashingtonHeights + neighbourhood_WestBrighton + neighbourhood_WestFarms + neighbourhood_WestVillage + neighbourhood_WestchesterSquare + neighbourhood_Whitestone + neighbourhood_Williamsbridge + neighbourhood_Williamsburg + neighbourhood_WindsorTerrace + neighbourhood_Woodhaven + neighbourhood_Woodlawn + neighbourhood_Woodside + room_type_Privateroom + room_type_Sharedroom'
Our Take on One-Hot-Encoding Results for Neighourhood Column
We have observed above that one-hot-encoding the Neighbourhood column above results in 100s of extra variables being produced. We recgoinse this, and tried to approach it by clustering, however this not excatly possible due to the nature of the dataset as the Neighbourhood_Group Variable is a "cluster" of those respective Neighbourhoods.
So we are left with two options, either we drop the Neighbourhood column as it can negatively affect our mulitpule regression model or Keep it if it provides us with a higher r-squared value.
To test this we will produce two OLS Regression Models, one without Neighbourhoods and one with Neighbourhoods to compare their r-squared values.
We have now gathered our formula and the DataFrame needed for the OLS Regression Results. Using this formula and DataFrame we will now we fit an OLS (ordinary least squares) model to our encoded data.
OLS Model Without Neighbourhoods
First we will begin by fitting the OLS Model without Neighbourhoods variable, to see what r-squared value is achieved.
For simplicity sake and to prevent the notebook from feeling cluttered we will include the whole code within the same cell and only display the r-squared value.
dfnew_without_neighbourhoods = df_airbnb
dfnew_without_neighbourhoods = dfnew_without_neighbourhoods.drop(columns = 'neighbourhood')
form_deps_without_neighbourhoods = ' + '.join(dfnew_without_neighbourhoods.columns)
form_str_without_neighbourhoods = 'price_per_night ~ ' + form_deps_without_neighbourhoods
df_airbnb_encoded_without_neighbourhoods = pd.get_dummies(dfnew_without_neighbourhoods, drop_first = True)
df_airbnb_encoded_without_neighbourhoods.columns = df_airbnb_encoded_without_neighbourhoods.columns.str.replace(' ', '') # Remove Spaces
df_airbnb_encoded_without_neighbourhoods.columns = df_airbnb_encoded_without_neighbourhoods.columns.str.replace("'", "") # Remove Astrophes
df_airbnb_encoded_without_neighbourhoods.columns = df_airbnb_encoded_without_neighbourhoods.columns.str.replace("-", "") # Remove Dashes
df_airbnb_encoded_without_neighbourhoods.columns = df_airbnb_encoded_without_neighbourhoods.columns.str.replace(".", "") # Removes Dots
form_deps_encoded_without_neighbourhoods = ' + '.join(df_airbnb_encoded_without_neighbourhoods.columns)
form_deps_encoded_without_neighbourhoods = form_deps_encoded_without_neighbourhoods.replace('price_per_night + ', '')
form_str_encoded_without_neighbourhoods = 'price_per_night ~ ' + form_deps_encoded_without_neighbourhoods
form_str_encoded_without_neighbourhoods
model_full_without_neighbourhoods = sm.formula.ols(formula = form_str_encoded_without_neighbourhoods, data = df_airbnb_encoded_without_neighbourhoods) # Creates a OLS Regression Model from The Formula and DataFrame
model_full_fitted_without_neighbourhoods = model_full_without_neighbourhoods.fit()
print('R-Squared Value is :', round(model_full_fitted_without_neighbourhoods.rsquared,3))
R-Squared Value is : 0.459
OLS Model With Neighbourhoods
We will now include the Neighbourhood Column, to see what R-Squared Value is acheived and how it compares to the model above.
model_full = sm.formula.ols(formula = form_str_encoded, data = df_airbnb_encoded) # Creates a OLS Regression Model from The Formula and DataFrame
model_full_fitted = model_full.fit()
print(model_full_fitted.summary())
mlr_reg = model_full_fitted # Store the Full-Fitted Model Inside of this Variable for Simplicity Sakes
OLS Regression Results
==============================================================================
Dep. Variable: price_per_night R-squared: 0.541
Model: OLS Adj. R-squared: 0.524
Method: Least Squares F-statistic: 30.56
Date: Sat, 23 Oct 2021 Prob (F-statistic): 0.00
Time: 15:37:11 Log-Likelihood: -25979.
No. Observations: 5000 AIC: 5.233e+04
Df Residuals: 4813 BIC: 5.355e+04
Df Model: 186
Covariance Type: nonrobust
=========================================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------------------------------
Intercept 107.8850 5.494 19.637 0.000 97.114 118.656
minimum_nights_to_book -0.2428 0.054 -4.490 0.000 -0.349 -0.137
number_of_reviews -0.0670 0.017 -4.003 0.000 -0.100 -0.034
reviews_per_month 0.8121 0.505 1.609 0.108 -0.178 1.802
calculated_host_listings_count 0.5484 1.131 0.485 0.628 -1.668 2.765
availability_365 0.0828 0.006 14.290 0.000 0.071 0.094
neighbourhood_group_Brooklyn 21.5653 5.714 3.774 0.000 10.363 32.767
neighbourhood_group_Manhattan 62.0573 5.564 11.152 0.000 51.148 72.966
neighbourhood_group_Queens 14.2124 6.040 2.353 0.019 2.371 26.054
neighbourhood_group_StatenIsland -18.5316 31.132 -0.595 0.552 -79.564 42.501
neighbourhood_Arrochar -43.2163 54.555 -0.792 0.428 -150.168 63.736
neighbourhood_Arverne 16.1165 13.544 1.190 0.234 -10.436 42.669
neighbourhood_Astoria 14.9301 5.616 2.658 0.008 3.920 25.941
neighbourhood_BathBeach -15.1108 25.290 -0.598 0.550 -64.690 34.469
neighbourhood_BatteryParkCity -8.7188 21.750 -0.401 0.689 -51.359 33.921
neighbourhood_BayRidge -18.4989 12.379 -1.494 0.135 -42.767 5.769
neighbourhood_BayTerrace 35.7787 30.970 1.155 0.248 -24.937 96.494
neighbourhood_Baychester -12.4006 43.731 -0.284 0.777 -98.134 73.333
neighbourhood_Bayside 9.0866 30.961 0.293 0.769 -51.610 69.783
neighbourhood_Bayswater 28.6414 25.363 1.129 0.259 -21.082 78.365
neighbourhood_BedfordStuyvesant 0.7246 3.398 0.213 0.831 -5.938 7.387
neighbourhood_BelleHarbor 121.5316 43.670 2.783 0.005 35.918 207.145
neighbourhood_Belmont -25.8859 25.615 -1.011 0.312 -76.102 24.330
neighbourhood_Bensonhurst 0.4139 15.629 0.026 0.979 -30.227 31.055
neighbourhood_BergenBeach -34.3040 43.642 -0.786 0.432 -119.862 51.254
neighbourhood_BoerumHill 23.8391 9.461 2.520 0.012 5.291 42.387
neighbourhood_BoroughPark -9.7760 19.677 -0.497 0.619 -48.352 28.800
neighbourhood_Briarwood -19.8588 15.747 -1.261 0.207 -50.730 11.013
neighbourhood_BrightonBeach -16.8661 17.999 -0.937 0.349 -52.152 18.420
neighbourhood_Bronxdale -20.2497 25.597 -0.791 0.429 -70.431 29.932
neighbourhood_BrooklynHeights 26.2611 10.341 2.540 0.011 5.988 46.534
neighbourhood_Brownsville -30.7926 15.630 -1.970 0.049 -61.434 -0.151
neighbourhood_Bushwick -2.7582 3.799 -0.726 0.468 -10.207 4.690
neighbourhood_CambriaHeights -15.8954 30.963 -0.513 0.608 -76.597 44.806
neighbourhood_Canarsie -18.6083 11.219 -1.659 0.097 -40.603 3.386
neighbourhood_CarrollGardens 41.0742 10.097 4.068 0.000 21.279 60.870
neighbourhood_Chelsea 15.7677 5.098 3.093 0.002 5.774 25.762
neighbourhood_Chinatown 7.0211 7.380 0.951 0.341 -7.446 21.488
neighbourhood_CityIsland -20.7297 31.143 -0.666 0.506 -81.784 40.325
neighbourhood_CivicCenter -0.6052 21.708 -0.028 0.978 -43.163 41.953
neighbourhood_ClasonPoint -17.7612 31.158 -0.570 0.569 -78.845 43.323
neighbourhood_Clifton -3.1451 44.535 -0.071 0.944 -90.454 84.164
neighbourhood_ClintonHill 8.8816 6.302 1.409 0.159 -3.473 21.237
neighbourhood_CobbleHill 23.3045 14.764 1.578 0.115 -5.640 52.249
neighbourhood_CollegePoint -45.6952 30.967 -1.476 0.140 -106.405 15.014
neighbourhood_ColumbiaSt 25.7862 19.660 1.312 0.190 -12.756 64.328
neighbourhood_Concord 49.7316 54.535 0.912 0.362 -57.183 156.646
neighbourhood_Concourse 39.4742 22.323 1.768 0.077 -4.290 83.238
neighbourhood_ConcourseVillage 28.9527 18.480 1.567 0.117 -7.276 65.181
neighbourhood_ConeyIsland -19.1322 31.005 -0.617 0.537 -79.917 41.652
neighbourhood_Corona -25.8978 25.351 -1.022 0.307 -75.598 23.802
neighbourhood_CrownHeights -5.6673 4.101 -1.382 0.167 -13.707 2.373
neighbourhood_CypressHills -24.4422 9.887 -2.472 0.013 -43.826 -5.058
neighbourhood_DUMBO 30.8540 25.284 1.220 0.222 -18.713 80.421
neighbourhood_DitmarsSteinway 11.6493 7.569 1.539 0.124 -3.189 26.488
neighbourhood_DonganHills 0.9088 54.595 0.017 0.987 -106.122 107.939
neighbourhood_DowntownBrooklyn 3.4698 12.377 0.280 0.779 -20.795 27.735
neighbourhood_DykerHeights -2.4403 43.921 -0.056 0.956 -88.545 83.664
neighbourhood_EastElmhurst 0.9511 11.381 0.084 0.933 -21.361 23.263
neighbourhood_EastFlatbush -20.3061 6.769 -3.000 0.003 -33.577 -7.035
neighbourhood_EastHarlem -21.4428 4.429 -4.842 0.000 -30.125 -12.761
neighbourhood_EastMorrisania 26.4284 43.752 0.604 0.546 -59.346 112.203
neighbourhood_EastNewYork -13.6214 10.611 -1.284 0.199 -34.423 7.180
neighbourhood_EastVillage 5.9101 3.912 1.511 0.131 -1.759 13.579
neighbourhood_Elmhurst -6.1711 10.525 -0.586 0.558 -26.805 14.463
neighbourhood_EmersonHill -18.0748 54.599 -0.331 0.741 -125.114 88.964
neighbourhood_FarRockaway -38.6635 22.024 -1.756 0.079 -81.841 4.514
neighbourhood_Fieldston 50.2545 43.718 1.150 0.250 -35.454 135.963
neighbourhood_FinancialDistrict 10.9602 7.147 1.534 0.125 -3.051 24.972
neighbourhood_Flatbush -14.7031 5.589 -2.631 0.009 -25.661 -3.746
neighbourhood_FlatironDistrict 6.4798 19.444 0.333 0.739 -31.640 44.600
neighbourhood_Flatlands -16.3341 15.627 -1.045 0.296 -46.971 14.303
neighbourhood_Flushing 2.7514 10.084 0.273 0.785 -17.019 22.522
neighbourhood_Fordham -19.1121 20.108 -0.950 0.342 -58.533 20.309
neighbourhood_ForestHills -4.7970 11.384 -0.421 0.673 -27.114 17.520
neighbourhood_FortGreene 27.2721 6.215 4.388 0.000 15.089 39.455
neighbourhood_FortHamilton -2.3798 14.761 -0.161 0.872 -31.318 26.559
neighbourhood_FreshMeadows -8.4472 31.011 -0.272 0.785 -69.243 52.349
neighbourhood_Glendale 18.6397 14.162 1.316 0.188 -9.124 46.403
neighbourhood_Gowanus 14.3232 8.654 1.655 0.098 -2.642 31.289
neighbourhood_Gramercy 10.0059 7.061 1.417 0.157 -3.836 23.848
neighbourhood_Gravesend -5.9445 19.653 -0.302 0.762 -44.474 32.585
neighbourhood_GreatKills 25.0659 54.574 0.459 0.646 -81.924 132.056
neighbourhood_Greenpoint 17.6386 4.865 3.625 0.000 8.100 27.177
neighbourhood_GreenwichVillage 25.4541 6.426 3.961 0.000 12.857 38.051
neighbourhood_GrymesHill 66.4389 54.534 1.218 0.223 -40.473 173.351
neighbourhood_Harlem -26.0413 3.449 -7.550 0.000 -32.803 -19.279
neighbourhood_HellsKitchen 17.5984 4.216 4.174 0.000 9.333 25.863
neighbourhood_Highbridge -10.4671 43.781 -0.239 0.811 -96.298 75.363
neighbourhood_Hollis -22.3157 43.666 -0.511 0.609 -107.920 63.289
neighbourhood_HowardBeach -6.0133 30.970 -0.194 0.846 -66.729 54.703
neighbourhood_HuntsPoint -10.9800 43.736 -0.251 0.802 -96.724 74.764
neighbourhood_Inwood -55.8205 7.661 -7.286 0.000 -70.840 -40.801
neighbourhood_JacksonHeights -1.4031 9.316 -0.151 0.880 -19.667 16.861
neighbourhood_Jamaica -15.3975 9.179 -1.677 0.094 -33.393 2.598
neighbourhood_JamaicaEstates -16.7907 25.348 -0.662 0.508 -66.484 32.903
neighbourhood_JamaicaHills 15.7750 43.664 0.361 0.718 -69.826 101.375
neighbourhood_Kensington -11.3327 13.405 -0.845 0.398 -37.612 14.947
neighbourhood_KewGardens -24.5923 19.748 -1.245 0.213 -63.307 14.123
neighbourhood_KewGardensHills -42.0643 25.358 -1.659 0.097 -91.777 7.648
neighbourhood_Kingsbridge -1.3082 18.484 -0.071 0.944 -37.545 34.929
neighbourhood_KipsBay 13.9519 6.602 2.113 0.035 1.008 26.896
neighbourhood_Laurelton 42.8014 22.030 1.943 0.052 -0.388 85.991
neighbourhood_LittleItaly -3.5091 13.849 -0.253 0.800 -30.660 23.642
neighbourhood_LongIslandCity 18.2052 6.356 2.864 0.004 5.745 30.666
neighbourhood_Longwood -2.6312 25.607 -0.103 0.918 -52.833 47.571
neighbourhood_LowerEastSide -0.2148 4.669 -0.046 0.963 -9.369 8.939
neighbourhood_MarbleHill -82.1627 43.222 -1.901 0.057 -166.897 2.572
neighbourhood_MarinersHarbor -24.2362 54.557 -0.444 0.657 -131.193 82.721
neighbourhood_Maspeth -1.3105 14.162 -0.093 0.926 -29.074 26.453
neighbourhood_Melrose -35.7889 43.717 -0.819 0.413 -121.494 49.916
neighbourhood_MiddleVillage -29.2612 25.357 -1.154 0.249 -78.972 20.450
neighbourhood_MidlandBeach 4.9590 54.547 0.091 0.928 -101.979 111.897
neighbourhood_Midtown 12.4780 4.995 2.498 0.013 2.686 22.270
neighbourhood_Midwood -15.4023 13.398 -1.150 0.250 -41.668 10.864
neighbourhood_MorningsideHeights -16.0941 7.871 -2.045 0.041 -31.525 -0.663
neighbourhood_MorrisHeights 25.5600 31.136 0.821 0.412 -35.481 86.601
neighbourhood_MorrisPark -8.4237 31.182 -0.270 0.787 -69.555 52.707
neighbourhood_Morrisania -3.9323 31.135 -0.126 0.900 -64.971 57.106
neighbourhood_MottHaven -7.6655 14.691 -0.522 0.602 -36.466 21.135
neighbourhood_MountEden 21.3451 31.130 0.686 0.493 -39.684 82.374
neighbourhood_MountHope -13.3447 22.327 -0.598 0.550 -57.116 30.426
neighbourhood_MurrayHill 25.5327 9.725 2.626 0.009 6.468 44.597
neighbourhood_NavyYard -7.2168 30.916 -0.233 0.815 -67.826 53.392
neighbourhood_NewBrighton 22.7212 54.560 0.416 0.677 -84.242 129.685
neighbourhood_NewSpringville -0.3360 44.556 -0.008 0.994 -87.685 87.013
neighbourhood_NoHo 14.4164 15.438 0.934 0.350 -15.850 44.683
neighbourhood_Nolita 26.6030 9.493 2.803 0.005 7.993 45.213
neighbourhood_NorthRiverdale -13.2070 31.148 -0.424 0.672 -74.272 47.858
neighbourhood_Norwood 35.9643 31.224 1.152 0.249 -25.249 97.178
neighbourhood_Oakwood 19.4358 44.538 0.436 0.663 -67.879 106.751
neighbourhood_OzonePark -12.4507 19.763 -0.630 0.529 -51.196 26.294
neighbourhood_ParkSlope 31.6122 5.883 5.374 0.000 20.080 43.145
neighbourhood_Parkchester -14.6732 44.010 -0.333 0.739 -100.953 71.606
neighbourhood_PelhamGardens -5.7565 25.602 -0.225 0.822 -55.948 44.435
neighbourhood_PortMorris 3.3026 31.145 0.106 0.916 -57.755 64.361
neighbourhood_PortRichmond 6.7598 54.595 0.124 0.901 -100.272 113.792
neighbourhood_PrincesBay 96.2310 54.534 1.765 0.078 -10.681 203.143
neighbourhood_ProspectHeights 19.6413 6.951 2.826 0.005 6.014 33.268
neighbourhood_ProspectLeffertsGardens -7.7989 6.441 -1.211 0.226 -20.426 4.829
neighbourhood_QueensVillage -11.8803 15.745 -0.755 0.451 -42.748 18.988
neighbourhood_RandallManor -26.8381 44.562 -0.602 0.547 -114.199 60.523
neighbourhood_RedHook -15.3648 13.413 -1.146 0.252 -41.661 10.931
neighbourhood_RegoPark 3.7952 10.068 0.377 0.706 -15.943 23.533
neighbourhood_RichmondHill 40.6415 19.750 2.058 0.040 1.922 79.361
neighbourhood_Ridgewood 8.7227 7.568 1.153 0.249 -6.115 23.560
neighbourhood_RockawayBeach 17.3353 19.752 0.878 0.380 -21.387 56.057
neighbourhood_RooseveltIsland -12.8114 13.234 -0.968 0.333 -38.757 13.134
neighbourhood_Rosebank 30.7048 54.558 0.563 0.574 -76.255 137.664
neighbourhood_Rosedale -13.7512 16.796 -0.819 0.413 -46.678 19.176
neighbourhood_SeaGate -18.2255 43.633 -0.418 0.676 -103.766 67.315
neighbourhood_SheepsheadBay -28.9727 11.220 -2.582 0.010 -50.968 -6.977
neighbourhood_SoHo 24.1644 9.299 2.599 0.009 5.934 42.395
neighbourhood_SouthBeach 2.6563 54.650 0.049 0.961 -104.483 109.796
neighbourhood_SouthOzonePark -10.2438 18.130 -0.565 0.572 -45.787 25.299
neighbourhood_SouthSlope 14.5719 6.824 2.135 0.033 1.193 27.951
neighbourhood_SpringfieldGardens 7.0427 14.952 0.471 0.638 -22.269 36.355
neighbourhood_StAlbans 5.7945 16.789 0.345 0.730 -27.119 38.708
neighbourhood_StGeorge 41.9671 36.377 1.154 0.249 -29.349 113.283
neighbourhood_Stapleton 23.7763 40.724 0.584 0.559 -56.061 103.614
neighbourhood_StuyvesantTown -8.9320 19.443 -0.459 0.646 -47.049 29.185
neighbourhood_Sunnyside -2.8786 7.904 -0.364 0.716 -18.374 12.616
neighbourhood_SunsetPark 3.1351 6.534 0.480 0.631 -9.675 15.945
neighbourhood_TheaterDistrict 28.7703 14.575 1.974 0.048 0.196 57.344
neighbourhood_ThrogsNeck -15.2741 43.714 -0.349 0.727 -100.974 70.426
neighbourhood_TodtHill -46.8885 54.552 -0.860 0.390 -153.835 60.058
neighbourhood_Tompkinsville 2.3143 36.379 0.064 0.949 -69.005 73.634
neighbourhood_Tottenville -40.1714 54.556 -0.736 0.462 -147.125 66.782
neighbourhood_Tremont 3.2888 31.219 0.105 0.916 -57.914 64.492
neighbourhood_Tribeca 56.2443 16.484 3.412 0.001 23.928 88.561
neighbourhood_TwoBridges 11.9165 13.901 0.857 0.391 -15.336 39.169
neighbourhood_UniversityHeights 19.3666 43.719 0.443 0.658 -66.343 105.076
neighbourhood_UpperEastSide -8.1384 4.164 -1.955 0.051 -16.301 0.024
neighbourhood_UpperWestSide -2.1566 3.962 -0.544 0.586 -9.924 5.611
neighbourhood_VanNest 9.4110 43.741 0.215 0.830 -76.341 95.163
neighbourhood_VinegarHill 41.2733 21.941 1.881 0.060 -1.741 84.287
neighbourhood_Wakefield 20.7533 43.716 0.475 0.635 -64.949 106.456
neighbourhood_WashingtonHeights -38.4443 4.883 -7.873 0.000 -48.017 -28.871
neighbourhood_WestBrighton -1.0233 40.689 -0.025 0.980 -80.792 78.745
neighbourhood_WestFarms 31.1300 43.731 0.712 0.477 -54.604 116.863
neighbourhood_WestVillage 33.8745 5.198 6.516 0.000 23.683 44.066
neighbourhood_WestchesterSquare 11.1090 43.720 0.254 0.799 -74.601 96.819
neighbourhood_Whitestone -6.3495 31.006 -0.205 0.838 -67.136 54.437
neighbourhood_Williamsbridge -50.2995 31.183 -1.613 0.107 -111.432 10.833
neighbourhood_Williamsburg 33.4823 3.359 9.967 0.000 26.896 40.068
neighbourhood_WindsorTerrace 10.0062 9.866 1.014 0.311 -9.336 29.348
neighbourhood_Woodhaven -16.8088 15.748 -1.067 0.286 -47.682 14.064
neighbourhood_Woodlawn 12.1322 25.638 0.473 0.636 -38.130 62.394
neighbourhood_Woodside -7.0403 10.525 -0.669 0.504 -27.673 13.593
room_type_Privateroom -72.5125 1.365 -53.105 0.000 -75.189 -69.836
room_type_Sharedroom -95.8231 5.425 -17.664 0.000 -106.458 -85.188
==============================================================================
Omnibus: 871.666 Durbin-Watson: 1.989
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1945.906
Skew: 1.005 Prob(JB): 0.00
Kurtosis: 5.302 Cond. No. 3.91e+18
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 7.84e-30. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
Summary of OLS Regression Results
With Neighboourhoods vs Without Neighbourhoods:
The r-squared value from the model contaning the Neighbourhood variables had a higer r-squared value of 0.541 compared to 0.459 for Without Neighourhoods. Thus, we will only use the OLS Model that contains the r-squared value of 0.541 for the rest of the investigation.
Analysis of OLS Regression Results
The OLS Regression Results show an R-Squared result of : 0.541. This indicates that 54.1% of the variation in price (price_per_night) is explained by our Independent Variables which are displayed above. By looking at the p-values, we observe that the majority of them are highly significant, though there are a few insignificant variables at a 5% level.
Furthermore, the Prob (F-Stastic) results display a value of 0.00 indicating that the regression is meaningful.
Setting up Residuals
actual_predicted_residual = pd.DataFrame({'actual': df_airbnb_encoded['price_per_night'],
'predicted': mlr_reg.fittedvalues,
'residual': mlr_reg.resid})
actual_predicted_residual.head(7)
| actual | predicted | residual | |
|---|---|---|---|
| 0 | 151 | 185.942453 | -34.942453 |
| 1 | 220 | 172.093146 | 47.906854 |
| 2 | 41 | 54.974036 | -13.974036 |
| 3 | 59 | 59.775382 | -0.775382 |
| 4 | 200 | 161.031925 | 38.968075 |
| 5 | 100 | 130.846824 | -30.846824 |
| 6 | 200 | 126.529173 | 73.470827 |
The above table shows the predictions of the hotel prices alongside the residuals which signify the deviance of the actual price with the predicted, based on the full regression model.
sns.regplot(x="actual",
y="predicted",
ci=None,
data =actual_predicted_residual,
color = "g",
scatter_kws={'alpha':0.3}).set_title('Scatter plot of Prices vs Predicted Prices for the Full Model with Regression Line', fontsize=15);
sns.scatterplot(x="actual",
y="predicted",
ci=None,
data =actual_predicted_residual,
color = "g",
alpha = 0.3).set_title('Scatter plot of Prices vs Predicted Prices for the Full Model without Regression Line', fontsize=15);
Whats observed from the above graph is that there is a generally linear relation between actual price and the predicted prices of the hotels untill the 150 dollar mark where it begins to fade out and plateu for the other half of the graph. The greatest density of the data is near the 50 to 200 dollar reigion, where majority of the predictions refered to.
Multiple linear regression depends on 4 conditions that we will check for: residuals being normal, residuals having constant variability, residuals are independant and a significant linear correlation.
Condtion 1 : Linearity
sns.scatterplot(x = actual_predicted_residual['predicted'], y = actual_predicted_residual['residual'], alpha = 0.3, color = 'g').set_title('Figure 1A : Scatter plot of Prices vs Predicted Prices for the Full Model', fontsize=15);
sns.regplot(x="predicted",
y="residual",
data =actual_predicted_residual,
color = "g",
scatter_kws={'alpha':0.3}).set_title('Figure 1B : Scatter plot of Prices vs Predicted Prices for the Full Model with Residual Line', fontsize=15);
This scatterplot and residual plot indicates that there is no linear relationship present. This is because there is no horizontal banding of points, thus indicating a non linear relationship. Furthermore, it's also evident that the residuals indicate outliers within the dataset as some residuals are randomly distanced away from the pattern such as (50,200) and (170, -200).
Condtion 2 : Constant Variability
sns.scatterplot(x = actual_predicted_residual['actual'], y = actual_predicted_residual['residual'], color="g", alpha = 0.3).set_title('Figure 2B : Actual Prices against Residuals');
There is a general upards trend in the residuals of the predicted hotel prices, with the majority of the prices and residuals densely located below the 150 dollar mark. However, a general trend of constant variability can be observed, with prominent horizontal streaks in visibly equal intervals, due, potentially, to the conseqences of the dataset's preprocessing.
Condtion 3 : Normality
residuals = mlr_reg.resid
probplot = st.probplot(residuals,fit=True,plot=plt)
plt.title('Figure 3A : Normal Probability Plot of Residuals')
plt.ylabel('Residuals')
plt.show();
While the normal probability plot of the residuals shows small deviations from the regression line, indicating minor irregularities, there are only few outliers that skew the residual distribution. There does appear to be one substantially elongated tail, however it still maintains a relatively normal distribution centred at 0.
sns.histplot(data = model_full_fitted, x = model_full_fitted.resid, bins = 40, color = "g").set(title = 'Figure 3B : Distribution of Residuals from Full Fitted Model');
plt.xlabel('Residuals')
plt.show();
The distribution of the residuals for the fully fitted regression model of this dataset shows to be normally distributed, thus signifying that this model supports, to an extent the normality presumption.
Condtion 4 : Stastical Independence
Since we don't know the times the observations were made, as it wasn't included in our dataset, and we don't have any spatial variables, we can't plot a "time of observation vs. residual" plot. However, we are certain that the observations in the model are independent due to the method we used to fit our model. Our full dataset included all listings in New York city, and to test for randomness in observations, we randomly sampled the data which allows listings to be observed in random different places, which makes our observations independent from each other.
Backwards Feature Selection
Credit of this code is referenced below.
## create the patsy model description from formula
patsy_description = patsy.ModelDesc.from_formula(form_str_encoded)
# initialize feature-selected fit to full model
linreg_fit = model_full_fitted
# do backwards elimination using p-values
p_val_cutoff = 0.05
## WARNING 1: The code below assumes that the Intercept term is present in the model.
## WARNING 2: It will work only with main effects and two-way interactions, if any.
print('\nPerforming backwards feature selection using p-values:')
while True:
pval_series = linreg_fit.pvalues.drop(labels='Intercept')
pval_series = pval_series.sort_values(ascending=False)
term = pval_series.index[0]
pval = pval_series[0]
if (pval < p_val_cutoff):
break
term_components = term.split(':')
print(f'\nRemoving term "{term}" with p-value {pval:.4}')
if (len(term_components) == 1): ## this is a main effect term
patsy_description.rhs_termlist.remove(patsy.Term([patsy.EvalFactor(term_components[0])]))
else: ## this is an interaction term
patsy_description.rhs_termlist.remove(patsy.Term([patsy.EvalFactor(term_components[0]),
patsy.EvalFactor(term_components[1])]))
linreg_fit = smf.ols(formula=patsy_description, data=df_airbnb_encoded).fit()
###
## this is the clean fit after backwards elimination
model_reduced_fitted = smf.ols(formula = patsy_description, data = df_airbnb_encoded).fit()
###
#########
print("\n***")
print(model_reduced_fitted.summary())
print("***")
print(f"Regression number of terms: {len(model_reduced_fitted.model.exog_names)}")
print(f"Regression F-distribution p-value: {model_reduced_fitted.f_pvalue:.4f}")
print(f"Regression R-squared: {model_reduced_fitted.rsquared:.4f}")
print(f"Regression Adjusted R-squared: {model_reduced_fitted.rsquared_adj:.4f}")
Performing backwards feature selection using p-values:
Removing term "neighbourhood_NewSpringville" with p-value 0.994
Removing term "neighbourhood_DonganHills" with p-value 0.9828
Removing term "neighbourhood_Bensonhurst" with p-value 0.9789
Removing term "neighbourhood_BedfordStuyvesant" with p-value 0.9844
Removing term "neighbourhood_CivicCenter" with p-value 0.9777
Removing term "neighbourhood_LowerEastSide" with p-value 0.9863
Removing term "neighbourhood_WestBrighton" with p-value 0.9737
Removing term "neighbourhood_SouthBeach" with p-value 0.9493
Removing term "neighbourhood_DykerHeights" with p-value 0.9439
Removing term "neighbourhood_Kingsbridge" with p-value 0.943
Removing term "neighbourhood_Longwood" with p-value 0.9665
Removing term "neighbourhood_Morrisania" with p-value 0.9498
Removing term "neighbourhood_EastElmhurst" with p-value 0.9299
Removing term "neighbourhood_Clifton" with p-value 0.9285
Removing term "neighbourhood_MidlandBeach" with p-value 0.905
Removing term "neighbourhood_Tompkinsville" with p-value 0.9127
Removing term "neighbourhood_Flushing" with p-value 0.9028
Removing term "neighbourhood_PelhamGardens" with p-value 0.9009
Removing term "neighbourhood_PortRichmond" with p-value 0.8945
Removing term "neighbourhood_RegoPark" with p-value 0.8806
Removing term "neighbourhood_MorrisPark" with p-value 0.87
Removing term "neighbourhood_Highbridge" with p-value 0.8811
Removing term "neighbourhood_HuntsPoint" with p-value 0.8789
Removing term "neighbourhood_Baychester" with p-value 0.8607
Removing term "neighbourhood_MottHaven" with p-value 0.8698
Removing term "neighbourhood_StAlbans" with p-value 0.8592
Removing term "neighbourhood_Bayside" with p-value 0.8482
Removing term "neighbourhood_Parkchester" with p-value 0.8464
Removing term "neighbourhood_ThrogsNeck" with p-value 0.839
Removing term "neighbourhood_FortHamilton" with p-value 0.8361
Removing term "neighbourhood_NorthRiverdale" with p-value 0.8336
Removing term "neighbourhood_LittleItaly" with p-value 0.8226
Removing term "neighbourhood_DowntownBrooklyn" with p-value 0.8213
Removing term "neighbourhood_SpringfieldGardens" with p-value 0.8065
Removing term "neighbourhood_NavyYard" with p-value 0.801
Removing term "neighbourhood_JamaicaHills" with p-value 0.7854
Removing term "neighbourhood_MountHope" with p-value 0.7807
Removing term "neighbourhood_HowardBeach" with p-value 0.7581
Removing term "neighbourhood_Whitestone" with p-value 0.755
Removing term "neighbourhood_Maspeth" with p-value 0.7572
Removing term "neighbourhood_ClasonPoint" with p-value 0.7505
Removing term "neighbourhood_UpperWestSide" with p-value 0.7452
Removing term "neighbourhood_BatteryParkCity" with p-value 0.7468
Removing term "neighbourhood_Gravesend" with p-value 0.7394
Removing term "neighbourhood_PortMorris" with p-value 0.7239
Removing term "neighbourhood_Tremont" with p-value 0.7356
Removing term "neighbourhood_FreshMeadows" with p-value 0.722
Removing term "neighbourhood_VanNest" with p-value 0.7147
Removing term "neighbourhood_StuyvesantTown" with p-value 0.7121
Removing term "neighbourhood_SunsetPark" with p-value 0.6954
Removing term "neighbourhood_JacksonHeights" with p-value 0.6936
Removing term "neighbourhood_WestchesterSquare" with p-value 0.693
Removing term "neighbourhood_FlatironDistrict" with p-value 0.6827
Removing term "neighbourhood_EmersonHill" with p-value 0.6745
Removing term "neighbourhood_SeaGate" with p-value 0.667
Removing term "neighbourhood_CityIsland" with p-value 0.6508
Removing term "neighbourhood_NewBrighton" with p-value 0.6185
Removing term "calculated_host_listings_count" with p-value 0.6135
Removing term "neighbourhood_Bronxdale" with p-value 0.6094
Removing term "neighbourhood_Hollis" with p-value 0.5879
Removing term "neighbourhood_Sunnyside" with p-value 0.5971
Removing term "neighbourhood_ForestHills" with p-value 0.6505
Removing term "neighbourhood_CambriaHeights" with p-value 0.6117
Removing term "neighbourhood_GreatKills" with p-value 0.5871
Removing term "neighbourhood_Oakwood" with p-value 0.5947
Removing term "neighbourhood_BoroughPark" with p-value 0.587
Removing term "neighbourhood_SouthOzonePark" with p-value 0.5857
Removing term "neighbourhood_Elmhurst" with p-value 0.6016
Removing term "neighbourhood_Fordham" with p-value 0.5732
Removing term "neighbourhood_group_StatenIsland" with p-value 0.5847
Removing term "neighbourhood_Rosebank" with p-value 0.6171
Removing term "neighbourhood_OzonePark" with p-value 0.57
Removing term "neighbourhood_Woodside" with p-value 0.587
Removing term "neighbourhood_Melrose" with p-value 0.565
Removing term "neighbourhood_JamaicaEstates" with p-value 0.5609
Removing term "neighbourhood_Belmont" with p-value 0.556
Removing term "neighbourhood_QueensVillage" with p-value 0.5386
Removing term "neighbourhood_BathBeach" with p-value 0.5384
Removing term "neighbourhood_ConeyIsland" with p-value 0.5244
Removing term "neighbourhood_Stapleton" with p-value 0.5126
Removing term "neighbourhood_Rosedale" with p-value 0.5096
Removing term "neighbourhood_UniversityHeights" with p-value 0.5067
Removing term "neighbourhood_Wakefield" with p-value 0.4921
Removing term "neighbourhood_MarinersHarbor" with p-value 0.4592
Removing term "neighbourhood_BergenBeach" with p-value 0.4363
Removing term "neighbourhood_RooseveltIsland" with p-value 0.4012
Removing term "neighbourhood_EastMorrisania" with p-value 0.4008
Removing term "neighbourhood_Woodlawn" with p-value 0.3993
Removing term "neighbourhood_Woodhaven" with p-value 0.3785
Removing term "neighbourhood_Kensington" with p-value 0.3772
Removing term "neighbourhood_Corona" with p-value 0.3734
Removing term "neighbourhood_Concord" with p-value 0.3687
Removing term "neighbourhood_WestFarms" with p-value 0.3704
Removing term "neighbourhood_MountEden" with p-value 0.3536
Removing term "neighbourhood_BrightonBeach" with p-value 0.3533
Removing term "neighbourhood_Bushwick" with p-value 0.3807
Removing term "neighbourhood_Flatlands" with p-value 0.3229
Removing term "neighbourhood_TwoBridges" with p-value 0.3142
Removing term "neighbourhood_NoHo" with p-value 0.3132
Removing term "neighbourhood_MiddleVillage" with p-value 0.3087
Removing term "neighbourhood_Briarwood" with p-value 0.3022
Removing term "neighbourhood_KewGardens" with p-value 0.3034
Removing term "neighbourhood_MorrisHeights" with p-value 0.2974
Removing term "neighbourhood_Midwood" with p-value 0.2877
Removing term "neighbourhood_RedHook" with p-value 0.2873
Removing term "neighbourhood_ProspectLeffertsGardens" with p-value 0.2972
Removing term "neighbourhood_Chinatown" with p-value 0.2792
Removing term "neighbourhood_RockawayBeach" with p-value 0.2765
Removing term "neighbourhood_CrownHeights" with p-value 0.2696
Removing term "neighbourhood_EastNewYork" with p-value 0.2965
Removing term "neighbourhood_Tottenville" with p-value 0.2456
Removing term "neighbourhood_RandallManor" with p-value 0.2333
Removing term "neighbourhood_Arrochar" with p-value 0.2294
Removing term "neighbourhood_Jamaica" with p-value 0.2212
Removing term "neighbourhood_GrymesHill" with p-value 0.2061
Removing term "neighbourhood_BayRidge" with p-value 0.2028
Removing term "neighbourhood_TodtHill" with p-value 0.1997
Removing term "neighbourhood_CollegePoint" with p-value 0.1964
Removing term "neighbourhood_Fieldston" with p-value 0.1898
Removing term "neighbourhood_BayTerrace" with p-value 0.1891
Removing term "neighbourhood_DUMBO" with p-value 0.1865
Removing term "neighbourhood_WindsorTerrace" with p-value 0.1881
Removing term "neighbourhood_Bayswater" with p-value 0.1845
Removing term "neighbourhood_Williamsbridge" with p-value 0.1829
Removing term "neighbourhood_Norwood" with p-value 0.1601
Removing term "neighbourhood_ColumbiaSt" with p-value 0.1583
Removing term "neighbourhood_Canarsie" with p-value 0.1553
Removing term "reviews_per_month" with p-value 0.1513
Removing term "neighbourhood_Gramercy" with p-value 0.1426
Removing term "neighbourhood_EastVillage" with p-value 0.1781
Removing term "neighbourhood_FinancialDistrict" with p-value 0.2028
Removing term "neighbourhood_FarRockaway" with p-value 0.1384
Removing term "neighbourhood_KewGardensHills" with p-value 0.1414
Removing term "neighbourhood_Arverne" with p-value 0.1077
Removing term "neighbourhood_StGeorge" with p-value 0.1032
Removing term "neighbourhood_Glendale" with p-value 0.1024
Removing term "neighbourhood_KipsBay" with p-value 0.08761
Removing term "neighbourhood_TheaterDistrict" with p-value 0.0894
Removing term "neighbourhood_Ridgewood" with p-value 0.08519
Removing term "neighbourhood_CobbleHill" with p-value 0.08212
Removing term "neighbourhood_Brownsville" with p-value 0.07408
Removing term "neighbourhood_PrincesBay" with p-value 0.06263
Removing term "neighbourhood_ConcourseVillage" with p-value 0.06695
Removing term "neighbourhood_ClintonHill" with p-value 0.05876
Removing term "neighbourhood_Gowanus" with p-value 0.05739
Removing term "neighbourhood_VinegarHill" with p-value 0.05789
Removing term "neighbourhood_Midtown" with p-value 0.05678
Removing term "neighbourhood_Concourse" with p-value 0.05552
Removing term "neighbourhood_DitmarsSteinway" with p-value 0.05357
***
OLS Regression Results
==============================================================================
Dep. Variable: price_per_night R-squared: 0.529
Model: OLS Adj. R-squared: 0.525
Method: Least Squares F-statistic: 136.0
Date: Sat, 23 Oct 2021 Prob (F-statistic): 0.00
Time: 15:37:49 Log-Likelihood: -26045.
No. Observations: 5000 AIC: 5.217e+04
Df Residuals: 4958 BIC: 5.245e+04
Df Model: 41
Covariance Type: nonrobust
====================================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------------------
Intercept 107.2251 4.115 26.059 0.000 99.158 115.292
minimum_nights_to_book -0.2439 0.053 -4.569 0.000 -0.349 -0.139
number_of_reviews -0.0481 0.014 -3.470 0.001 -0.075 -0.021
availability_365 0.0802 0.006 14.378 0.000 0.069 0.091
neighbourhood_group_Brooklyn 22.2173 4.148 5.356 0.000 14.086 30.349
neighbourhood_group_Manhattan 68.9129 4.269 16.143 0.000 60.544 77.282
neighbourhood_group_Queens 15.6305 4.498 3.475 0.001 6.812 24.450
neighbourhood_Astoria 15.7012 5.104 3.076 0.002 5.696 25.707
neighbourhood_BelleHarbor 123.6851 44.520 2.778 0.005 36.406 210.964
neighbourhood_BoerumHill 25.1015 9.360 2.682 0.007 6.753 43.451
neighbourhood_BrooklynHeights 27.2770 10.280 2.653 0.008 7.123 47.431
neighbourhood_CarrollGardens 42.5645 10.027 4.245 0.000 22.907 62.222
neighbourhood_Chelsea 10.8835 4.918 2.213 0.027 1.242 20.525
neighbourhood_CypressHills -21.7989 9.797 -2.225 0.026 -41.005 -2.593
neighbourhood_EastFlatbush -18.3889 6.478 -2.839 0.005 -31.089 -5.689
neighbourhood_EastHarlem -26.1634 4.179 -6.260 0.000 -34.357 -17.970
neighbourhood_Flatbush -13.3998 5.193 -2.580 0.010 -23.581 -3.219
neighbourhood_FortGreene 28.5941 5.878 4.865 0.000 17.070 40.118
neighbourhood_Greenpoint 18.8710 4.352 4.336 0.000 10.338 27.404
neighbourhood_GreenwichVillage 20.4119 6.358 3.210 0.001 7.947 32.876
neighbourhood_Harlem -30.8547 3.044 -10.136 0.000 -36.822 -24.887
neighbourhood_HellsKitchen 13.0781 3.932 3.326 0.001 5.370 20.786
neighbourhood_Inwood -60.9523 7.673 -7.943 0.000 -75.995 -45.909
neighbourhood_Laurelton 45.0863 22.335 2.019 0.044 1.299 88.873
neighbourhood_LongIslandCity 18.7285 5.932 3.157 0.002 7.099 30.358
neighbourhood_MarbleHill -87.7979 44.479 -1.974 0.048 -174.996 -0.600
neighbourhood_MorningsideHeights -20.9939 7.897 -2.659 0.008 -36.475 -5.513
neighbourhood_MurrayHill 20.5472 9.844 2.087 0.037 1.248 39.847
neighbourhood_Nolita 21.6354 9.604 2.253 0.024 2.808 40.463
neighbourhood_ParkSlope 32.6893 5.509 5.934 0.000 21.890 43.488
neighbourhood_ProspectHeights 20.6724 6.677 3.096 0.002 7.582 33.763
neighbourhood_RichmondHill 41.2103 19.998 2.061 0.039 2.005 80.415
neighbourhood_SheepsheadBay -26.6033 11.185 -2.378 0.017 -48.531 -4.676
neighbourhood_SoHo 19.5233 9.399 2.077 0.038 1.097 37.950
neighbourhood_SouthSlope 15.7757 6.541 2.412 0.016 2.953 28.598
neighbourhood_Tribeca 50.6766 16.877 3.003 0.003 17.590 83.763
neighbourhood_UpperEastSide -12.8783 3.876 -3.322 0.001 -20.478 -5.279
neighbourhood_WashingtonHeights -43.2636 4.686 -9.232 0.000 -52.451 -34.077
neighbourhood_WestVillage 28.7847 5.025 5.728 0.000 18.933 38.636
neighbourhood_Williamsburg 34.7668 2.459 14.137 0.000 29.945 39.588
room_type_Privateroom -72.4730 1.305 -55.529 0.000 -75.032 -69.914
room_type_Sharedroom -96.0520 5.206 -18.451 0.000 -106.258 -85.846
==============================================================================
Omnibus: 845.441 Durbin-Watson: 1.995
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1804.885
Skew: 0.995 Prob(JB): 0.00
Kurtosis: 5.169 Cond. No. 1.10e+04
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.1e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
***
Regression number of terms: 42
Regression F-distribution p-value: 0.0000
Regression R-squared: 0.5293
Regression Adjusted R-squared: 0.5254
Setting Up Residuals For Reduced Model
Below are the actual and predicted values from the reduced model, along with the residuals calculating their difference.
residuals_reduced = pd.DataFrame({'actual': df_airbnb_encoded['price_per_night'],
'predicted': model_reduced_fitted.fittedvalues,
'residual': model_reduced_fitted.resid})
residuals_reduced.head(7)
| actual | predicted | residual | |
|---|---|---|---|
| 0 | 151 | 185.874727 | -34.874727 |
| 1 | 220 | 177.603604 | 42.396396 |
| 2 | 41 | 57.637477 | -16.637477 |
| 3 | 59 | 57.148205 | 1.851795 |
| 4 | 200 | 161.547677 | 38.452323 |
| 5 | 100 | 131.426098 | -31.426098 |
| 6 | 200 | 141.619494 | 58.380506 |
sns.regplot(x = residuals_reduced['actual'],
y = residuals_reduced['predicted'],
ci=None,
color = "g",
scatter_kws={'alpha':0.3}).set_title('Scatter plot of Prices vs Predicted Prices for the Reduced Model with Regression Line', fontsize=15);
sns.scatterplot(x= residuals_reduced['actual'],
y= residuals_reduced['predicted'],
ci=None,
color = "g",
alpha = 0.3).set_title('Scatter plot of Prices vs Predicted Prices for the Reduced Model Without Regression Line', fontsize=15);
plt.ylabel('Predicted prices')
plt.xlabel('Actual prices')
plt.show();
This scatter plot of the predicted prices against the residuals for the reduced regression model closely resembles that of the one for the full regression model, with a similar, yet slightly more linear pattern observed which is more defined that the previous scatter plot.
Checking conditions for Reduced Regression Model
sns.scatterplot(x = residuals_reduced['predicted'], y = residuals_reduced['residual'], color = "g").set(title = 'Figure 1A : Predicted values against Residuals');
plt.xlabel('Predicted prices')
plt.ylabel('Residuals')
plt.show();
This scatter plot of the predicted prices against the residuals for the reduced regression model closely resembles that of the one for the full regression model, with a similar, yet slightly more linear pattern observed which is more defined that the previous scatter plot.
sns.histplot(x = residuals_reduced['residual'], bins = 40, color = "g").set_title("Figure 1B: Distribution of Residuals");
plt.xlabel('Residuals')
plt.show();
This histogram plot of the residuals of the new reduced regression model closely resembles that of the one for the full regression model, with a similar tail to the first, making it slightly right skewed. Though, it maintains a roughly normal distribution that is similarly centred at 0.
sns.scatterplot(x = residuals_reduced['actual'], y = residuals_reduced['residual'], color = "g").set(title = 'Figure 1C : Actual price against Residuals');
plt.xlabel('Actual prices')
plt.ylabel('Residuals')
plt.show();
This scatterplot shows the relationship between the residuals of the reduced model and the actual prices of the dataset. We are still seeing a roughly constant spread of residuals in this plot, similarly to the figure shown in the full model residual variance plot. This plot has barely changed, with most of the prices packed in the under-150 dollar price range.
The overall approach adequately fitted the model with all the relevant features and variables from our dataset. It also adequately removes the variables from the fully fitted model with a p-value over 0.05, performing quite well.
A limitation to our approach may be in the way we preprocessed our data which could have potentially caused the vertical stripe patterns in our plots of the features from the regression model. It has been thought that such a pattern in our plots may have come from getting rid of instances of data with certain variables within a specific feature, thus making the likelihood of certain prices with those limited amounts of variables greater, along those vertical lines.
Also the way we preprocessed the data during the first phase, namely the way outliers were treated, may have positively affected the residuals plotted and shown through our statistical modelling approach, as observed residuals from the regression model maintained a degree of constant variability.
Moreover, the reduced lack of sample size in relation to the great number of variables we had after one hot encoding our 8 features may have contributed to a more diluted and low-resolution prediction. Thus, the reduced sample size of our dataset used through the statistical modelling may have reduced the predictive potential of our data set, especially given the highly granular nature of one of our features (neighbourhood).
Furthermore, by the nature of our dataset (Airbnb listings in NY), spatial variables could have been used in addition to simply categorical features. This may have provided greater predictive potential, as opposed to simply using granulated categorical features, out of fear of its complexity.
We began by removing features of our dataset that don’t aid predictive modelling. Then we cleaned it for missing values and certain columns for outliers, carefully considering decisions that would not be detrimental to the predictive quality of our model, in case columns like price with large outliers are necessary for maintaining the accuracy of our model. We then fitted the cleaned data to an OLS regression model and checked for whether that data should drop the neighbourhoods feature or include it based on its r-squared value. We doubted the safety of using a column with hundreds of categorical values, though we ended up keeping it for its higher resultant r-squared value. We then checked for the assumptions of a multiple linear regression model by testing linearity, residual normality, and constant variance. We couldn’t test for independence but we had justification for why our observations were independent. We then performed backward feature selection by using code from the sample template and checked for residuals of the data. We then checked conditions for our reduced linear model and noted the changes.
The regression model based on our preprocessed data showed support for the regression assumptions, except for linearity which didn't necessarily have a clear structure, though had some linear patterns. What our findings showed, in regards to hotel prices, is that most hotel prices were below 200, and it was in this region that great normality was observed in residuals. After the 200 dollar mark, it was observed that the histogram plot had skewed more towards the right side, appearing to be more stretched, showing that there was greater variability after the 200 dollar mark.
After reducing the model through the backward selection feature process, we ended up with 41 features for our model. The graphs of the residuals from both the full and reduced model showed no visible variation, showing that despite the p-value being low for such features, they were important for prediction nonetheless. The removed features were removed because it was believed that they did not have enough data backing them up, due to our limited 5000 sample size, considering the nature of the dataset.
Moreover, the R-squared and adjusted R-squared values for the full model were 0.541 and 0.524 respectively whereas for the reduced model, they were 0.529 and 0.525 respectively. The adjusted R-squared value only increased in 0.001 units, meaning that the removal of 145 features did not have much of an influence on the model's predictive power, which may have supported our observation of the dataset not having enough data supporting the multitude of features post-one-hot encoding.
We have found that our New York Airbnb listings dataset includes features that significantly impact prices as well as features that have a minor effect on our response variable. Our objective was to find factors that affected the price of Airbnb listings the greatest and to use them to better predict the prices of listings based on features of our dataset. In addition, our predictive modelling was intended to help predict prices of Airbnb listings for many use cases like to aid banking firms, to help market researchers, and improve price accuracy for consumers and Airbnb hosts alike. Overall, our predictive model was moderately accurate at predicting the prices of Airbnb listings based on the R-Squared Value.
References